Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu Languages
نویسندگان
چکیده
We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which includes data from nine Southern Bantu languages. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpus. We also report on phoneme distance measures across languages, and describe initial phone recognisers that were developed using this data.
منابع مشابه
EACL 2009 Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages
We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which includes data from nine Southern Bantu languages. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpus. We also repor...
متن کاملPooling ASR data for closely related languages
We describe several experiments that were conducted to assess the viability of data pooling as a means to improve speech-recognition performance for under-resourced languages. Two groups of closely related languages from the Southern Bantu language family were studied, and our tests involved phoneme recognition on telephone speech using standard tied-triphone Hidden Markov Models. Approximately...
متن کاملCollecting and evaluating speech recognition corpora for 11 South African languages
We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which contains data from the eleven official languages of South Africa. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpu...
متن کاملPhonetics of intonation in South African Bantu languages
Much is already known about the prosodic systems of the indigenous South African languages from descriptions and analyses in the existing literature. All of the existing work has been carried out in the field of African studies or formal linguistics. In order to be able to implement the generalisations obtained into computational models in speech processing, the existing sources and results mus...
متن کاملThe Consequences of the Contacts between Bantu and Non-Bantu Languages around Lake Eyasi in Northern Tanzania
In rural Tanzania, recent major influences happen between Kiswahili and English to ethnic languages rather than ethnic languages, which had been in contact for so long, influencing each other. In this work, I report the results of investigation of lexical changes in indigenous languages that aimed at examining how ethnic communities and their languages, namely Cushitic Iraqw, Nilotic Datooga, N...
متن کامل